Fine Tuning

Supervised Fine-Tuning (SFT)
Reinforcement Learning with Human Feedback (RLHF)
- Proximal Policy
- Rejection Sampling Fine-Tuning

Parameter-Efficient Finetuning (PEFT) Methods

Only training a small set of parameters which might be a subset of the existing model parameters or a set of newly added parameters.
These methods differ in parameter efficiency, memory efficiency, training speed, final quality of the model, and additional inference costs (if any)
Following PEFT comparision content is verbatim from the paper Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

Taxonomy of PEFT: a birds-eye view

PEFT methods can be classified in multiple ways.
They may be differentiated by their underlying approach or conceptual framework:
does the method introduce new parameters to the model, or
does it fine-tune a small subset of existing parameters?
Alternatively, they may be categorized according to their primary objective:
does the method aim to minimize memory footprint or
only storage efficiency?
In this section, we begin by presenting a taxonomy based on the former

Additive methods

The main idea behind additive methods is augmenting the existing pre-trained model with extra parameters or layers and training only the newly added parameters. - As of now, this is the largest and widely explored category of PEFT methods. - Within this category, two large subcategories have emerged: Adapter-like methods and soft prompts.

Adapters

Adapters are a type of additive parameter-efficient fine-tuning method that involves introducing small fully connected networks after Transformer sub-layers.
The idea has been widely adopted, and multiple variations of Adapters have been proposed.
These variations include modifying the placement of adapters, pruning, and using reparametrization to reduce the number of trainable parameters.
GitHub: https://github.com/adapter-hub/adapters

# A schematic adapter implementation:
def transformer_block_with_adapter(x):
    residual = x
    x = SelfAttention(x)
    x = FFN(x) # adapter
    x = LN(x + residual)
    residual = x
    x = FFN(x) # transformer FFN
    x = FFN(x) # adapter
    x = LN(x + residual)
    return x

Unlike the transformer FFN block, Adapters usually have a smaller hidden dimension than the input.
Pfeiffer et al. (2020a) found that inserting the adapter only after the self-attention layer (after normalization) achieves similar performance as using two adapters per transformer block.

Soft Prompts

Soft prompts are a type of additive parameter-efficient fine-tuning method that involves adding a small number of parameters to the pre-trained model.
Soft Prompts Language model prompting aims to control the behavior of a language model by modifying the input text, which typically consists of a task description accompanied by a few in-context examples.
However, these methods are difficult to optimize and are inherently limited in the number of training examples by the maximum model input length.
To address these drawbacks, the concept of “soft” prompts was introduced where a part of the model’s input embeddings is fine-tuned via gradient descent.
This pivots the problem of finding prompts in a discrete space to a continuous optimization problem.
Soft prompts can be trained for the input layer only or for all layers.
Recent advancements explore how soft prompts could be pre-trained or prompts for different tasks utilized to reduce the computation required for fine-tuning a soft prompt for a new task.

Why add parameters?

Although these methods introduce additional parameters to the network, they achieve significant training time and memory efficiency improvements by reducing the size of the gradients and the optimizer states.
Note that in the case of Adam (Kingma and Ba, 2015), for every byte of trainable parameter, one extra byte is needed for its gradient, and two more bytes are needed to store the optimizer state: the first and second moments of the gradient.
In practice, depending on the setup, training a model requires 12-20 times more GPU memory than the model weights.
By saving memory on optimizer states, gradients, and allowing frozen model parameters to be quantized, additive PEFT methods enable the fine-tuning of much larger networks or the use of larger microbatch sizes. Which improves training throughput on GPUs.
Moreover, optimizing fewer parameters in distributed setups drastically reduces communication volume.

Selective methods

Arguably the earliest example of selective PEFT is fine-tuning only a few top layers of a network.
Modern approaches are usually based on the type of the layer or the internal structure, such as tuning only model biases or only particular rows.
An extreme version of selective methods is sparse update methods which can completely ignore the structure of the model, and select parameters individually.
However, sparse parameter updates present multiple engineering and efficiency challenges, some of which have been tackled in recent research on parameter reconfiguration and NxM sparsity (Holmes et al., 2021).
Nevertheless, unrestricted unstructured sparsity is still impractical on contemporary hardware.

Reparametrization-based methods

Reparametrization-based methods are a type of PEFT that involves reparametrizing the model in a way that reduces the number of trainable parameters.
Reparametrization-based parameter-efficient finetuning methods leverage low-rank representations to minimize the number of trainable parameters. The notion that neural networks have lowdimensional representations has been widely explored in both empirical and theoretical analysis of deep learning
Aghajanyan et al. (2020) have demonstrated that fine-tuning can be performed effectively in low-rank subspaces.
Further, they showed that the size of the subspace that needs adaption is smaller for bigger models or the models pre-trained for longer.
LoRA:
Most well-known reparametrization-based method is Low-Rank Adaptation or LoRa (Hu et al., 2021), which employs a simple low-rank matrix decomposition to parametrize the weight update $ \delta W = W^{down}W$.
This approach is straightforward to implement and has been evaluated on models with up to 175 billion parameters.
More recent works have also explored the use of Kronecker product reparametrization (δW = A ⊗ B), which yields a more favorable tradeoff between rank and parameter count.